AITopics | line break

Collaborating Authors

line break

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

e2cfb719f58585f779d0a4f9f07bd618-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsApr-30-2026, 02:17:07 GMT

A.1 Creation of the Multimodal Web Document Dataset A.1.1 Collecting of a Large Number of HTMLFiles Our data collection process begins by considering the 25 most recent Common Crawl6 dumps available at the time of dataset creation. It contains webpages spanning from February 2020 to January/February 2023. We use a modified version of readability-lxml7 to extract the main text from the pages, discarding any pages that contain text of excessively high perplexity. This process yields a total of 41.2 billion documents. Selection of English content To identify non-English content, we apply the FastText classifier (Joulin et al., 2017) to the extracted text, e ectively filtering out 63.6% of the documents. Early text deduplication Often, a set of URLs is crawled repeatedly across di erent Common Crawl snapshots. However, the content of these websites may vary as web administrators make changes over time. Hence, at this stage, we refrain from deduplicating documents based on their URLs. Instead, we perform MinHash (Broder, 1997) deduplication with 16 hashes calculated over 5-grams. To further refine the data, we eliminate documents containing substantial proportions of repeated paragraphs and n-grams, employing the methodology described in MassiveText (Rae et al., 2022).

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Africa (1.00)
North America > Canada (0.93)
(2 more...)

Genre: Research Report > Experimental Study (0.46)

Industry:

Leisure & Entertainment > Sports > Martial Arts (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
(14 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Mobile (0.68)
(2 more...)

Add feedback

e2cfb719f58585f779d0a4f9f07bd618-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-17-2026, 14:56:24 GMT

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Asia > Russia (0.14)
Asia > Armenia (0.14)
Africa > Tanzania (0.14)
(82 more...)

Genre: Research Report > Experimental Study (0.46)

Industry:

Leisure & Entertainment > Sports > Martial Arts (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
(13 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Mobile (0.68)
(2 more...)

Add feedback

so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs

Bhyravajjula, Sriharsh, Walsh, Melanie, Preus, Anna, Antoniak, Maria

arXiv.org Artificial IntelligenceNov-4-2025

Whitespace is a critical component of poetic form, reflecting both adherence to standardized forms and rebellion against those forms. Each poem's whitespace distribution reflects the artistic choices of the poet and is an integral semantic and spatial feature of the poem. Yet, despite the popularity of poetry as both a long-standing art form and as a generation task for large language models (LLMs), whitespace has not received sufficient attention from the NLP community. Using a corpus of 19k English-language published poems from Poetry Foundation, we investigate how 4k poets have used whitespace in their works. We release a subset of 2.8k public-domain poems with preserved formatting to facilitate further research in this area. We compare whitespace usage in the published poems to (1) 51k LLM-generated poems, and (2) 12k unpublished poems posted in an online community. We also explore whitespace usage across time periods, poetic forms, and data sources. Additionally, we find that different text processing methods can result in significantly different representations of whitespace in poetry data, motivating us to use these poems and whitespace patterns to discuss implications for the processing strategies used to assemble pretraining datasets for LLMs.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.16713

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Analysis of Line Break prediction models for detecting defensive breakthrough in football

Yagi, Shoma, Ichikawa, Jun, Ichinose, Genki

arXiv.org Artificial IntelligenceNov-4-2025

In football, attacking teams attempt to break through the opponent's defensive line to create scoring opportunities. This action, known as a Line Break, is a critical indicator of offensive effectiveness and tactical performance, yet previous studies have mainly focused on shots or goal opportunities rather than on how teams break the defensive line. In this study, we develop a machine learning model to predict Line Breaks using event and tracking data from the 2023 J1 League season. The model incorporates 189 features, including player positions, velocities, and spatial configurations, and employs an XGBoost classifier to estimate the probability of Line Breaks. The proposed model achieved high predictive accuracy, with an AUC of 0.982 and a Brier score of 0.015. Furthermore, SHAP analysis revealed that factors such as offensive player speed, gaps in the defensive line, and offensive players' spatial distributions significantly contribute to the occurrence of Line Breaks. Finally, we found a moderate positive correlation between the predicted probability of being Line-Broken and the number of shots and crosses conceded at the team level. These results suggest that Line Breaks are closely linked to the creation of scoring opportunities and provide a quantitative framework for understanding tactical dynamics in football.

artificial intelligence, line break, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.00121

Country:

Asia > Japan > Honshū > Kansai (0.15)
Asia > Japan > Honshū > Chūbu (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Sports > Soccer (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Kuznetsov, Kristian, Kushnareva, Laida, Druzhinina, Polina, Razzhigaev, Anton, Voznyuk, Anastasia, Piontkovskaya, Irina, Burnaev, Evgeny, Barannikov, Serguei

arXiv.org Artificial IntelligenceMar-5-2025

Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.

arxiv, dataset, strong strengthening, (13 more...)

arXiv.org Artificial Intelligence

2503.03601

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Africa (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
(8 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Media (0.68)
Health & Medicine > Therapeutic Area (0.68)
Government > Regional Government > North America Government > United States Government (0.46)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Predicting Punctuation in Ancient Chinese Texts: A Multi-Layered LSTM and Attention-Based Approach

Cai, Tracy, Chang, Kimmy, Nabi, Fahad

arXiv.org Artificial IntelligenceSep-16-2024

In fact, Previous approaches have experimented with many ancient Chinese texts contain thousands of Encoder-Decoder RNNs, GRU, and LSTMs as well lines with no distinct punctuation marks or delimiters as different single-headed attention structures (local in sight. The lack of punctuation in such texts and global) to successfully conduct language makes it difficult for humans to identify when there translation tasks. One recent work that built an pauses or breaks between particular phrases and efficient model for optimal performance in a task understand the semantic meaning of the written similar to ours (predicting line breaks) is that of text (Mogahed, 2012). As a result, unless one was Oh et al. In Oh et al (2017), researchers were able educated in the ancient time period, many readers to predict where line breaks ought to be in Hanmun of ancient Chinese would have significantly different (a punctuation-lacking Korean script) with a interpretations of the texts. We propose an approach multi-layered LSTM model that incorporated an to predict the location (and type) of punctuation end-of-sentence attention mechanism. As Luong in ancient Chinese texts that extends the work et al. (2015) found local attention models to significantly of Oh et al (2017) by leveraging a bidirectional outperform non-attentional ones on translation multi-layered LSTM with a multi-head attention tasks between English-German, we were inspired mechanism as inspired by Luong et al.'s (2015) discussion to improve upon Oh et al.'s approach towards of attention-based architectures. We find line-break prediction by paying special attention to that the use of multi-layered LSTMs and multihead the attention model.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2409.10783

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.05)
Asia > China (0.04)

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Lyrics Transcription for Humans: A Readability-Aware Benchmark

Cífka, Ondřej, Schreiber, Hendrik, Miner, Luke, Stöter, Fabian-Robert

arXiv.org Artificial IntelligenceJul-30-2024

Writing down lyrics for human consumption involves not only accurately capturing word sequences, but also incorporating punctuation and formatting for clarity and to convey contextual information. This includes song structure, emotional emphasis, and contrast between lead and background vocals. While automatic lyrics transcription (ALT) systems have advanced beyond producing unstructured strings of words and are able to draw on wider context, ALT benchmarks have not kept pace and continue to focus exclusively on words. To address this gap, we introduce Jam-ALT, a comprehensive lyrics transcription benchmark. The benchmark features a complete revision of the JamendoLyrics dataset, in adherence to industry standards for lyrics transcription and formatting, along with evaluation metrics designed to capture and assess the lyric-specific nuances, laying the foundation for improving the readability of lyrics. We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.

lyric, lyric transcription, punctuation, (16 more...)

arXiv.org Artificial Intelligence

2408.0637

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Rhode Island (0.04)
Europe > Greece (0.04)
(6 more...)

Genre: Research Report (0.50)

Industry:

Media > Music (0.88)
Leisure & Entertainment (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)

Add feedback

Shimo Lab at "Discharge Me!": Discharge Summarization by Prompt-Driven Concatenation of Electronic Health Record Sections

He, Yunzhen, Yamagiwa, Hiroaki, Shimodaira, Hidetoshi

arXiv.org Artificial IntelligenceJun-26-2024

In this paper, we present our approach to the shared task "Discharge Me!" at the BioNLP Workshop 2024. The primary goal of this task is to reduce the time and effort clinicians spend on writing detailed notes in the electronic health record (EHR). Participants develop a pipeline to generate the "Brief Hospital Course" and "Discharge Instructions" sections from the EHR. Our approach involves a first step of extracting the relevant sections from the EHR. We then add explanatory prompts to these sections and concatenate them with separate tokens to create the input text. To train a text generation model, we perform LoRA fine-tuning on the ClinicalT5-large model. On the final test data, our approach achieved a ROUGE-1 score of $0.394$, which is comparable to the top solutions.

brief hospital course, discharge, input text, (12 more...)

arXiv.org Artificial Intelligence

2406.18094

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
North America > United States > Pennsylvania (0.04)
(8 more...)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)
Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Therapeutic Area > Gastroenterology (0.96)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark

Cífka, Ondřej, Dimitriou, Constantinos, Wang, Cheng-i, Schreiber, Hendrik, Miner, Luke, Stöter, Fabian-Robert

arXiv.org Artificial IntelligenceNov-23-2023

Current automatic lyrics transcription (ALT) benchmarks focus exclusively on word content and ignore the finer nuances of written lyrics including formatting and punctuation, which leads to a potential misalignment with the creative products of musicians and songwriters as well as listeners' experiences. For example, line breaks are important in conveying information about rhythm, emotional emphasis, rhyme, and high-level structure. To address this issue, we introduce Jam-ALT, a new lyrics transcription benchmark based on the JamendoLyrics dataset. Our contribution is twofold. Firstly, a complete revision of the transcripts, geared specifically towards ALT evaluation by following a newly created annotation guide that unifies the music industry's guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds. Secondly, a suite of evaluation metrics designed, unlike the traditional word error rate, to capture such phenomena. We hope that the proposed benchmark contributes to the ALT task, enabling more precise and reliable assessments of transcription systems and enhancing the user experience in lyrics applications such as subtitle renderings for live captioning or karaoke.

benchmark, lyric transcription, transcription, (14 more...)

arXiv.org Artificial Intelligence

2311.13987

Country:

North America > United States > Rhode Island (0.04)
Europe > Greece (0.04)
Europe > United Kingdom > England > East Sussex > Brighton (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.36)

Add feedback